Data Cleaning with Pandas

pandas
data-cleaning
preprocessing
A comprehensive guide to handling duplicates and outliers in Pandas DataFrames, with practical examples and best practices.
Author

Mohammed Adil Siraju

Published

September 19, 2025

Welcome to this tutorial on data cleaning using Pandas! Data cleaning is a crucial step in any data analysis workflow. In this notebook, we’ll cover two essential techniques: - Handling duplicates: Removing or managing repeated rows. - Detecting and removing outliers: Using statistical methods like IQR (Interquartile Range).

By the end, you’ll have practical skills to preprocess messy datasets effectively.

1. Setting Up and Creating Sample Data

First, let’s import Pandas and create a sample DataFrame to work with.

import pandas as pd

data1 = {
    'A': [1,2,2,3,3],
    'B': [4,5,5,6,7]
}

df1 = pd.DataFrame(data1)
df1
A B
0 1 4
1 2 5
2 2 5
3 3 6
4 3 7

2. Dealing with Duplicates

Duplicates can skew your analysis. Pandas provides easy methods to detect and remove them.

Checking for Duplicates

df1.duplicated().sum()
np.int64(1)
df1.drop_duplicates()
A B
0 1 4
1 2 5
3 3 6
4 3 7
df1.drop_duplicates(subset=['A'])
A B
0 1 4
1 2 5
3 3 6

3. Handling Outliers

Outliers are extreme values that can distort statistical analysis. We’ll use the IQR method to detect and filter them.

Creating Sample Data with Outliers


data2 = {
    'A': [1,2,2,3,11, 11],
    'B': [1,5,5,6,12, 25]
}

df2 = pd.DataFrame(data2)
df2
A B
0 1 1
1 2 5
2 2 5
3 3 6
4 11 12
5 11 25
df2.describe()
A B
count 6.000000 6.000000
mean 5.000000 9.000000
std 4.690416 8.602325
min 1.000000 1.000000
25% 2.000000 5.000000
50% 2.500000 5.500000
75% 9.000000 10.500000
max 11.000000 25.000000

Calculating IQR and Bounds

q_low = df2['B'].quantile(0.25)
q_high = df2['B'].quantile(0.75)

iqr = q_high - q_low
l_bound = q_low - 1.5 * iqr
u_bound = q_high + 1.5 * iqr  # Fixed: upper bound should be plus, not minus

print(f"Lower bound: {l_bound}")
print(f"Upper bound: {u_bound}")
df2
A B
0 1 1
1 2 5
2 2 5
3 3 6
4 11 12
5 11 25

Filtering Out Outliers

df_filtered = df2[(df2['B']>l_bound) & (df2['B']<u_bound)]
df_filtered
A B
0 1 1
1 2 5
2 2 5
3 3 6
4 11 12